Skip to content

refactor: remove redundant partitioned_by_file_group file scan field#23189

Open
Phoenix500526 wants to merge 2 commits into
apache:mainfrom
Phoenix500526:issue/23099
Open

refactor: remove redundant partitioned_by_file_group file scan field#23189
Phoenix500526 wants to merge 2 commits into
apache:mainfrom
Phoenix500526:issue/23099

Conversation

@Phoenix500526

@Phoenix500526 Phoenix500526 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

FileScanConfig had two overlapping ways to declare a file scan's output
partitioning:

The bool is just a lazy shorthand for one specific output_partitioning value
(Partitioning::Hash over the partition columns), and every place that consumed
it (output_partitioning(), repartitioned(), create_sibling_state()) already
checked output_partitioning.is_some() || partitioned_by_file_group. Keeping both
is redundant and the ListingTable builder ended up setting both. This PR makes
output_partitioning the single source of truth.

What changes are included in this PR?

Following the issue's first option ("Remove partitioned_by_file_group"):

  • Remove FileScanConfig::partitioned_by_file_group, the corresponding
    FileScanConfigBuilder field, and the with_partitioned_by_file_group builder
    method.
  • ListingTable::scan now derives the partition-column Partitioning::Hash
    itself (once its file groups are finalized, so the partition count is correct)
    and passes it through the existing with_output_partitioning. The previous
    with_output_partitioning(declared) + with_partitioned_by_file_group(...)
    double-set is collapsed into one branch.
  • hash_partitioning_from_partition_fields is made pub so ListingTable
    (a separate crate) can reuse the derivation instead of duplicating the
    column-index resolution.
  • proto already round-trips output_partitioning, so no behavior is lost: the
    now-vestigial partitioned_by_file_group wire field is left unset on write and
    ignored on read. The field is kept in the .proto definition for backward
    compatibility.
  • output_partitioning() / create_sibling_state() / repartitioned() now key
    solely off output_partitioning.

Are these changes tested?

Yes — by existing tests, updated to the new single-field model:

  • datafusion-datasource: test_output_partitioning_with_partition_columns,
    test_output_partitioning_no_partition_columns,
    test_declared_output_partitioning_projects_with_scan, and the file_stream
    work-stealing test morsel_partitioned_by_file_group_keeps_files_local (which
    verifies that a declared output partitioning keeps each stream's files local).
  • datafusion-proto: roundtrip_parquet_exec_output_partitioning (and the other
    roundtrip_parquet_exec_* cases) cover the partitioning round-trip. The old
    roundtrip_parquet_exec_partitioned_by_file_group test exercised the removed
    API and is dropped, as its coverage is subsumed by the output_partitioning
    round-trip test.

All of the above pass, along with cargo fmt --all --check and
cargo clippy --all-targets --all-features -- -D warnings for the affected
crates.

Are there any user-facing changes?

Yes — public API changes :

  • Removed: the public FileScanConfig::partitioned_by_file_group field and the
    FileScanConfigBuilder::with_partitioned_by_file_group method. Callers should
    set with_output_partitioning(Some(Partitioning::Hash(..))) instead (or use the
    now-public hash_partitioning_from_partition_fields helper).
  • Added: hash_partitioning_from_partition_fields is now pub.

Query results, optimizer decisions (e.g. eliding RepartitionExec), and the
serialized (proto) wire format are unchanged. There is one display-only
change: EXPLAIN now renders output_partitioning=Hash(...) on DataSourceExec
for partition-grouped scans. The scan already produced that partitioning before
(it was derived lazily inside output_partitioning()); it is now stored on the
output_partitioning field and therefore shown. The
repartition_subset_satisfaction and preserve_file_partitioning slt expected
plans are updated accordingly.

cargo-semver-checks will flag the removals as breaking, which is expected for
this cleanup.

@github-actions github-actions Bot added catalog Related to the catalog crate proto Related to proto crate datasource Changes to the datasource crate labels Jun 25, 2026
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-catalog-listing v54.0.0 (current)
       Built [  34.926s] (current)
     Parsing datafusion-catalog-listing v54.0.0 (current)
      Parsed [   0.010s] (current)
    Building datafusion-catalog-listing v54.0.0 (baseline)
       Built [  34.538s] (baseline)
     Parsing datafusion-catalog-listing v54.0.0 (baseline)
      Parsed [   0.010s] (baseline)
    Checking datafusion-catalog-listing v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.092s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  70.588s] datafusion-catalog-listing
    Building datafusion-datasource v54.0.0 (current)
       Built [  28.565s] (current)
     Parsing datafusion-datasource v54.0.0 (current)
      Parsed [   0.026s] (current)
    Building datafusion-datasource v54.0.0 (baseline)
       Built [  28.556s] (baseline)
     Parsing datafusion-datasource v54.0.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-datasource v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.287s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/inherent_method_missing.ron

Failed in:
  FileScanConfigBuilder::with_partitioned_by_file_group, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/1c6f8a0256807fdae23280a2d0c22cfdac108e76/datafusion/datasource/src/file_scan_config/mod.rs:520

--- failure struct_pub_field_missing: pub struct's pub field removed or renamed ---

Description:
A publicly-visible struct has at least one public field that is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/struct_pub_field_missing.ron

Failed in:
  field partitioned_by_file_group of struct FileScanConfig, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/1c6f8a0256807fdae23280a2d0c22cfdac108e76/datafusion/datasource/src/file_scan_config/mod.rs:211

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  58.498s] datafusion-datasource
    Building datafusion-proto v54.0.0 (current)
       Built [  45.319s] (current)
     Parsing datafusion-proto v54.0.0 (current)
      Parsed [   0.015s] (current)
    Building datafusion-proto v54.0.0 (baseline)
       Built [  45.347s] (baseline)
     Parsing datafusion-proto v54.0.0 (baseline)
      Parsed [   0.016s] (baseline)
    Checking datafusion-proto v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.296s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  91.959s] datafusion-proto
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 140.741s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.018s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 140.075s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.089s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 283.454s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 25, 2026
`FileScanConfig` had two overlapping ways to declare file scan output
partitioning: the `partitioned_by_file_group` bool and `output_partitioning`.
Collapse them onto `output_partitioning` as the single source of truth.

- Remove the `partitioned_by_file_group` field, the builder field, and the
  `with_partitioned_by_file_group` builder method.
- `ListingTable` now derives the partition-column `Partitioning::Hash` once its
  file groups are finalized and passes it via `with_output_partitioning`;
  `hash_partitioning_from_partition_fields` is made `pub` for this.
- proto already round-trips `output_partitioning`, so the now-vestigial wire
  bool is left unset on write and ignored on read (the proto field is kept for
  backward compatibility).

Closes apache#23099.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
After collapsing `partitioned_by_file_group` onto `output_partitioning`, the
declared Hash partitioning is now stored on the scan and therefore rendered by
`DataSourceExec`'s Display. Update the affected sqllogictest expected plans
accordingly. Behavior is unchanged; only the EXPLAIN text gains an
`output_partitioning=Hash(...)` entry on partition-grouped scans.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
@github-actions github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change catalog Related to the catalog crate datasource Changes to the datasource crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove redundant partitioned_by_file_group file scan field

1 participant